Systems and methods for matching records using geographic proximity

ABSTRACT

Contact objects in one or more databases can be matched using various systems and methods to determine geographical proximity between the objects. Location attributes associated with first and second objects can be compared to determine a distance between the locations associated with the objects. The objects can then be grouped if the distance is less than a threshold distance.

This application claims the benefit of U.S. provisional application No.61/658,498, filed on Jun. 12, 2012. This and all other referencedextrinsic materials are incorporated herein by reference in theirentirety. Where a definition or use of a term in a reference that isincorporated by reference is inconsistent or contrary to the definitionof that term provided herein, the definition of that term providedherein is deemed to be controlling.

FIELD OF THE INVENTION

The field of the invention is data processing, and specifically dataprocessing systems and methods for matching records.

BACKGROUND

It is a common goal for data processors to remove duplicate records froma database of records (e.g., customers' contact information), asduplicate records provide inaccurate information, and can result inwasted mailing costs and customer dissatisfaction.

In the past, duplicate records were uncovered using a “brute force”algorithm, where each record is compared to every other record in adatabase. For example, a database having ten records would requireforty-five comparisons. Adding an additional record to the databasewould require ten additional comparisons, and adding another recordwould require eleven additional comparisons, and so forth. Althoughcomparisons can be done very quickly with today's computers, the sheernumber of comparisons required even for small databases (e.g., onemillion records) can easily exceed practical time spans. For example, inthe case of one million records, one trillion comparisons would berequired.

To reduce the amount of processing time required, it is known to firstcluster records that share a certain attribute. For example, a databaseof records could be clustered by the first digit of each record's zipcode, creating ten clusters. Each record in the cluster is then comparedto every other record in the cluster using a “brute force” algorithm.Although this process reduces processing time, the process is incompletebecause records in one cluster are not compared with records in otherclusters. Thus, if a record in cluster A were to match another record incluster B, the match would not be found.

Various other processes of detecting duplicate records are described inthe art. See, e.g., U.S. Pat. No. 6,374,241 to Lamburt et al.; U.S. pat.publ. no. 2005/0273452 to Molloy, et al. (publ. December 2005); U.S.pat. Publ. no 2011/0191353 (publ. March 2011); U.S. pat. publ. no.2012/0059853 (publ. March 2012); WIPO publ. no. 00/34897 to BloodhoundSoftware, Inc. (publ. June 2000); and WIPO publ. no. 2009/132263 toLexis-Nexis Risk & Information Analytics Group, Inc. (publ. October2009). However, all the processes known to Applicants are alsoincomplete and fail to appreciate that geographical proximity betweenrecords can be used to determine whether the records are duplicates.

Thus, there is still a need for efficient systems and methods that matchrecords using geographical proximity.

SUMMARY OF THE INVENTION

The inventive subject matter provides apparatus, systems and methods inwhich one can match records in one or more contact databases usinggeographical proximity. The one or more contact databases cancollectively include a plurality of contact objects, each of which hasan associated location attribute.

To determine whether duplicates exist among the plurality of contactobjects, a location attribute of a first contact object can be comparedto a location attribute of a second contact object to determine adistance between the first and second contact objects using a matchingengine. If the distance is less than a threshold distance, a groupidentification number can be associated with the first and secondcontact objects. Any additional records that are found to match therecords associated with that group identification number can also beassociated with that group identification number.

Unless the context dictates the contrary, all ranges set forth hereinshould be interpreted as being inclusive of their endpoints, andopen-ended ranges should be interpreted to include commerciallypractical values. Similarly, all lists of values should be considered asinclusive of intermediate values unless the context indicates thecontrary.

Various objects, features, aspects and advantages of the inventivesubject matter will become more apparent from the following detaileddescription of preferred embodiments, along with the accompanyingdrawing figures in which like numerals represent like components.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart illustrating one embodiment of a method formatching records in a database.

FIG. 2 is a flowchart illustrating another embodiment of a method formatching records in a database using a centroid of the contact objectsof a group.

FIG. 3A is a diagram of one embodiment of a system for matching recordsin a database.

FIG. 3B is a diagram of one embodiment of a system for matching recordsin a database.

FIG. 4 is a diagram of a center point of a set of records associatedwith the same group identification number.

DETAILED DESCRIPTION

It should be noted that while the following description is drawn to acomputer/server based data processing system, various alternativeconfigurations are also deemed suitable and may employ various computingdevices including servers, interfaces, systems, databases, agents,peers, engines, controllers, or other types of computing devicesoperating individually or collectively. One should appreciate thecomputing devices comprise a processor configured to execute softwareinstructions stored on a tangible, non-transitory computer readablestorage medium (e.g., hard drive, solid state drive, RAM, flash, ROM,etc.). The software instructions preferably configure the computingdevice to provide the roles, responsibilities, or other functionality asdiscussed below with respect to the disclosed apparatus. In especiallypreferred embodiments, the various servers, systems, databases, orinterfaces exchange data using standardized protocols or algorithms,possibly based on HTTP, HTTPS, AES, public-private key exchanges, webservice APIs, known financial transaction protocols, or other electronicinformation exchanging methods. Data exchanges preferably are conductedover a packet switched network, the Internet, LAN, WAN, VPN, or othertype of packet switched network.

The following discussion provides many example embodiments of theinventive subject matter. Although each embodiment represents a singlecombination of inventive elements, the inventive subject matter isconsidered to include all possible combinations of the disclosedelements. Thus if one embodiment comprises elements A, B, and C, and asecond embodiment comprises elements B and D, then the inventive subjectmatter is also considered to include other remaining combinations of A,B, C, or D, even if not explicitly disclosed.

One should appreciate that the disclosed techniques provide manyadvantageous technical effects including increasing the efficiency ofdata processing of one or more database to identify matches among therecords in the one or more databases. By using geographical proximity,records can be matched that have non-identical fields, and that mightotherwise have been missed by conventional processes.

In FIG. 1, an embodiment of a method 100 for matching contact objects ina database is shown using geographical proximity. Method 100 preferablyincludes step 110 of providing access to at least one contact databaseconfigured to store a plurality of contact objects. Each of the contactobjects can have an associated location attribute. Although the locationattribute can include postal address information such as an address, acity, a state, and a zip code including zip+4, it is preferred that thelocation attribute comprises geographical coordinates such as latitudeand longitude. It is contemplated that such information can beuser-provided or derived from the contact object using software or someother source (e.g., a geo-location device or a geo-point table).Exemplary software includes GeoCoder Object™ sold by Melissa DATA™,although any commercially suitable software could be used.

The method 100 can further include step 120 of providing access to amatching engine that is communicatively coupled to the at least onecontact database. The matching engine can advantageously be used tomatch contact objects within or among one or more databases and therebyuncover duplicate objects. In its simplest form, the plurality ofrecords can be matched by comparing each record in the one or moredatabases with every other record in the one or more databases. Althoughthere are optimizations to this process that can be applied, thefollowing discussion is based on this simplification.

Each of the contact objects can have various attributes including, forexample, a last name, a first name, one or more addresses (e.g.,city/state/zip and/or geographical coordinates), a phone number, anemail address, contact preferences, and so forth. Because of the breadthof data often associated with each record, it is common practice to useonly a subset of a database's fields in the matching process. Commonfield types used for matching include, for example, a first name, a lastname, a street address, a phone number, a company name, and so forth.

A user can specify the conditions of determining whether records matchby defining a set of matching rules, such as by using a matchinginterface. For example, using the matching interface, a user mightdefine the matching rules to include the parameters “first name”, “lastname”, and “located within 10 miles (16.09 kilometers)”. In anotherexample, the user might define the matching rules to include theparameters “phone number”, “last name” and “located within 15 miles(24.14 kilometers)”. In still another example, the user might insteaddefine the matching rules to include the parameters “company name”,“city” and “located within 5 miles (8.047 kilometers)”. It iscontemplated that the matching interface could utilize drop-down menusor text-based inputs, for example, to allow a user to define thematching rules.

In step 130, the matching engine can be programmed or otherwiseconfigured to compare first and second location attributes of first andsecond contact objects, respectively, to determine a geographicaldistance between the first and second location attributes. An exemplarymatching engine is MatchUp® offered by Melissa DATA™, although anycommercially suitable software could be used.

In some contemplated embodiments, the distance between the first andsecond contact objects can be determined by calculating the great circledistance between the objects. The great circle distance is calculatedusing the spherical law of cosines, as shown in the following formula:D=R*(arccos(sin latitude₁*sin latitude₂+cos latitude₁*cos latitude₂*(coslatitude₁−latitudte₂)

where D is distance, latitude₁ and longitude₁ are the coordinatesassociated with a first record, latitude₂ and longitude₂ are coordinatesassociated with a second record, and R is the radius of the Earth,approximately 3963.16 miles (63787.1 kilometers). Of course, otherdistance measurements could be used without departing from the scope ofthe inventive subject matter described herein.

In step 140 a decision is made preferably via the matching enginewhether the geographical distance between the first and second contactobjects is calculated to be less than a threshold distance. If thegeographical distance is less than the threshold distance, the matchingengine or other component preferably automatically assigns or otherwiseassociates a group identification number with each of the first andsecond contact objects in step 150.

The threshold distance is preferably user defined prior to initiatingthe matching process. Contemplated threshold distances include, forexample, 1 mile (1.609 kilometers), 5 mile (8.047 kilometers), 10 miles(16.09 kilometers), 15 miles (24.14 kilometers), 20 miles (32.19kilometers), and so forth, although it is further contemplated that theuser could manually input a threshold distance different from thesedistances.

It is contemplated that the matching engine could stop searching foradditional matches to the objects assigned a group identification numberonce an initial match is found between those objects. However, suchapproach could possibly result in missed matches. For example, inexample 1 below, this approach would not have matched object 3 withobject 1 despite the objects being within the threshold distance.

Alternatively, any additional objects found to be matching with anobject of a group would also be associated with that same groupidentification number. For example, as shown in FIG. 2, step 210 caninvolve selecting a third contact object, which has a third locationattribute, from the plurality of contact objects. The first and thirdlocation attributes can be compared using the matching engine in step220 to determine a geographical distance between the first and thirdlocation attributes. In step 230, and if the geographical distancebetween the first and third contact objects is less than the thresholddistance, then in step 240 the first group number can then be associatedwith the third contact object. However, such approach can result insituations where many or all of the contact objects are found to bewithin the threshold distance of at least one of the other contactobjects within a group, and are thus all determined to be matching.

To potentially avoid this issue, it is contemplated that the thresholddistance can be manually altered and used as an optimization tool of themethod. Whereby the threshold distance can be set prior to a matchingprocess based on knowledge of the geographic area covered by theplurality of contact objects and/or the number of contact objects thatare being matched.

In other contemplated embodiments, all additional matching objects canbe compared to the first contact object, and the distance between eachof the matching contact objects and the first contact object must beless than the threshold distance for each object to match and beassigned the group identification number associated with the firstobject. While useful, such embodiments would likely be very sensitive tothe order in which the contact objects are evaluated.

In still other contemplated embodiments, and as depicted in step 220, acentroid or center point of the contact objects having the sameassociated group identification number can be calculated by averagingthe latitude points and the longitude points of the objects in thatgroup. This centroid can be recalculated each time a new contact objectis added/associated with the group to ensure that the centroid isaccurate. To determine the average latitude and longitude values, it iscontemplated that a centroid value, a total latitude value, a totallongitude value, and a count of the number of objects in the group canbe stored in the contact database or other location.

When a new contact object is associated with the group, the latitude andlongitude attributes associated with that contact object can be added tothe respective total latitude and longitude value, and the count of thenumber of contact objects in that group can be incremented by one. A newcentroid value can then be determined by dividing each of the totallatitude value and the total longitude value by the number of objects(count) value.

After the plurality of records has been analyzed, each of the groups ofrecords can be individually examined. It is contemplated that thematching engine or other component could be programmed or otherwiseconfigured to analyze the contact objects associated with a group anddelete or move duplicate objects, edit duplicate objects, mergeduplicate objects, and so forth. For example, as previously discussed,certain matching rules may dictate comparing the “first name” and “lastname” parameters, and whether the distance between the correspondingobjects is less than a threshold distance. The matching engine oranother designated component, may determine whether a match satisfyingthe above conditions exists between object A and object B, and willeither edit object B to conform with object A, merge the attributes ofobject B with object A, or completely delete object B, etc. In anotherexample, certain matching rules may dictate a match if “company name”and “area of business” parameters of two given objects match, and if thedistance between the same contact objects is less than a thresholddistance. In this case, if a match is determined then similarly,matching objects may be deleted, merged, or edited to eliminateduplicity within the related contact database. For example, companynames could be normalized where necessary based on one or more matchingrules. Such matching rules may be used to eliminate duplicate contactobjects pertaining to businesses and/or other artificial entities.

FIGS. 3A and 3B illustrate an embodiment of a system 300 for matchingcontact objects 350A-N in a database using geographical proximity. Thesystem 300 can include one or more contact databases 330A-N configuredto store a plurality of contact objects 350A-N, where each objectcomprises a location attribute 360A-N. A matching engine 310 can becommunicatively coupled to the contact database 330A-N, and programmedor otherwise configured to compare location attributes between a firstand second contact object to determine a geographical distance betweenthe first and second location attributes. If the geographical distanceis less than a threshold distance, the matching engine 310 can associatea group number with each of the first and second contact objects.

The system 300 can further include a matching interface 320 configuredto allow a user 340 to identify conditions to be used by the matchingengine 310 to determine whether any given contact objects match. Forexample, the user 340 could specify the parameters of the objects toreview, as well as the threshold distance.

EXAMPLE 1

A user desires to determine matches in a database of contact objects,which have the same first and last name and a geographical distance ofless than 5 miles (8.047 kilometers). After conducting an analysis ofobjects sharing the same first and last names, a first duplicate groupcould contain the following objects:

TABLE 1 # First Name Last Name Address City State Latitude Longitude 1.John Smith 82 Salem St Boston MA 42.36371 −71.055833 2. John Smith 36High St Boston MA 42.354136 −71.055231 3. John Smith 1634 Beacon StBrookline MA 42.33955 −71.135625 4. John Smith 390 Needham St Newton MA42.307091 −71.216475 5. John Smith 145 Central St Wellesley MA 42.300192−71.266785 6. John Smith 9 Worcester St Natick MA 42.304363 −71.326819

Next, the latitude and longitude of the objects can be compared todetermine the geographic proximity of the objects with respect to eachother. For example, Objects 1 and 2 can be compared, and determined tohave a geographical distance between them of 0.662 miles (1.065kilometers). Because this distance is less than the threshold distanceof 5 miles (8.047 kilometers), the objects are considered to be matchingand are each associated with a first group identification number.

Objects 1 and 3 are then compared, and determined to have a geographicaldistance between them of 4.412 miles (7.1 kilometers). Because thisdistance is less than the threshold distance of 5 miles (8.047kilometers), Object 3 considered to be matching with Object 1, and isassociated with the first group identification number. Object 4 iscalculated to be 9.107 miles (14.66 kilometers) from Object 1, so itdoes not directly match Object 1. However, when Object 4 is comparedwith Object 3, it is determined to be 4.708 miles (7.577 kilometers)from Object 3, and thus a match with Object 3. Object 4 is then assignedthe same group identification number as Object 3, which is also the samegroup identification number as Objects 1 and 2. Thus, in this example,Objects 1, 2, 3 and 4 are considered matches.

Following the same algorithm, Objects 4 and 5 are calculated to have adistance of 2.621 miles (4.218 kilometers), and therefore match, andObjects 5 and 6 are calculated to have a distance of 3.089 miles (4.971kilometers), and therefore match. Thus, in this example, all six of theobjects in this set are considered to match, despite that Object 1 andObject 6 are calculated to be 14.469 miles (23.29 kilometers) apart,which is much greater than the user-specified threshold distance.

EXAMPLE 2

TABLE 2 # First Name Last Name Address City State Latitude Longitude 7.John Smith 37 Manchester Ct Newtown MA 42.87456 −71.011475 8. John Smith56 Bristol St Newtown MA 42.83556 −71.011175 9. John Smith 7 Newport CrNewtown MA 42.81357 −71.023578

If Object 7 of Table 2, is matched with Object 1 of Table 1, thedistance between the two objects is calculated to be 35.37 miles (56.92kilometers). Similarly, Object 8 and Object 9 are calculated to have adistance of 32.62 miles (52.49 kilometers) and 31.12 miles (50.09kilometers) respectively, from Object 1. Thus, Object 7, Object 8, andObject 9 are not considered to be matching to Object 1, and are notassociated with the first group identification number. However, sinceObject 7 and Object 8 are calculated to be 2.69 miles (4.34 kilometers)apart and Object 7 and Object 9 are calculated to be 4.26 miles (6.85kilometers) apart, the objects of Table 2 meet the thresholdrequirement, and are associated with a second group identificationnumber. Thus, the objects of Table 2 are matched and the same processapplied above to the objects of Table 1 is employed here for thepurposes of removing, editing, merging, or taking any other appropriateaction to eliminate duplicity of objects.

EXAMPLE 3

Using the above six objects shown in Table 1, Object 1 and Object 2 arecompared, and are found to match. Each of Objects 1 and 2 are thenassociated with a first group identification number. The matching engineor other component then calculates a centroid (center point) of thelocations associated with Objects 1 and 2, and associates that centroidwith the first group. In this example, the centroid of the first groupwould have coordinates [42.358923, −71.055532].

Object 3 can then be compared with the first group's centroid todetermine a distance between the centroid and Object 3. Object 3 isfound to be 4.313 miles (6.941 kilometers) from the first group'scentroid, and is associated with the first group identification number.Because a new object was associated with the first group, the centroidof the first group is then recalculated to account for all the objectsin the first group. The new centroid of the first group would havecoordinates [42.352465, −71.08223].

Each of Objects 4, 5, and 6 can be individually compared with the newcentroid of the first group, and are found to have a distance of 7.555miles (12.16 kilometers), 10.118 miles (16.28 kilometers), and 12.959miles (20.86 kilometers), respectively, which are all greater than thethreshold distance.

To determine a threshold distance adjustment, a group of three or morepoints (n) 410 420 430 440 450 can be represented by a polygon of npoints. See FIG. 4. The threshold distance adjustment analysis presumesa worst-case scenario, where every point is a maximum distance fromevery other point of the group. This can be modeled with a regularpolygon having n sides of length D 470 (the user-specified thresholddistance). The center of the polygon (C) 460 also represents the points'centroid. For example, as shown in FIG. 4, for 5 points, the worst-caselength for perimeter lines ab, bc, cd, de and ea is D 470. From centerpoint C, we wish to find the radius r 480, from C to any of points a, b,c, d, or e.

Using the logic discussed above, it was determined that theuser-specified threshold distance must be adjusted to compensate foreach additional object added to a group. The formula for thiscalculation is:r=D/sin(π/n)

where r is the adjusted threshold distance and D is the user-specifieddistance. For n points, a point must be, at most, distance r from thecentroid C, such that it is no greater than distance D from any otherpoint. For 3 points in the above example, the adjusted threshold wouldbe 5.77 miles (9.286 kilometers), for 4 points, 7.07 miles (11.38kilometers), for 5 points, 8.506 miles (13.69 kilometers), and so forth.

As used herein, and unless the context dictates otherwise, the term“coupled to” is intended to include both direct coupling (in which twoelements that are coupled to each other contact each other) and indirectcoupling (in which at least one additional element is located betweenthe two elements). Therefore, the terms “coupled to” and “coupled with”are used synonymously.

It should be apparent to those skilled in the art that many moremodifications besides those already described are possible withoutdeparting from the inventive concepts herein. The inventive subjectmatter, therefore, is not to be restricted except in the scope of theappended claims. Moreover, in interpreting both the specification andthe claims, all terms should be interpreted in the broadest possiblemanner consistent with the context. In particular, the terms “comprises”and “comprising” should be interpreted as referring to elements,components, or steps in a non-exclusive manner, indicating that thereferenced elements, components, or steps may be present, or utilized,or combined with other elements, components, or steps that are notexpressly referenced. Where the specification claims refers to at leastone of something selected from the group consisting of A, B, C . . . andN, the text should be interpreted as requiring only one element from thegroup, not A plus N, or B plus N, etc.

What is claimed is:
 1. A method for matching contact objects in adatabase using geographical proximity comprising: providing access to atleast one contact database configured to store a plurality of contactobjects comprising first, second, and third contact objects, wherein thefirst contact object has a first geographic coordinated, the secondcontact object has a second geographic coordinate, and the third contactobject has a third geographic coordinate; providing access to a matchingengine that is communicatively coupled to the at least one contactdatabase; comparing the first and second geographic coordinates usingthe matching engine to determine a first geographical distance betweenthe first and second geographic coordinates; associating a group numberwith each of the first and second contact objects and calculating acentroid geographic coordinate if the first geographical distance isless than a threshold distance; determining a second geographicaldistance between the third geographic coordinate and the centroidgeographic coordinate; and associating the group number with the thirdcontact object if the second geographical distance is less than thethreshold distance.
 2. The method of claim 1, wherein the thresholddistance is at least five miles.
 3. The method of claim 1, wherein thethreshold distance is at least ten miles.
 4. The method of claim 1,further comprising deleting the first contact object.
 5. The method ofclaim 1, further comprising merging at least a portion of the first andsecond contact objects.
 6. The method of claim 1, wherein each of theplurality of contact objects includes a last name attribute, an addressattribute, a phone number attribute, and further comprising identifyinga subset of contact objects from the plurality of contact objects basedon at least one of the last name attribute, the address attribute, andthe phone number attribute using the matching engine, wherein the subsetcomprises the first and second contact objects.
 7. The method of claim1, further comprising: calculating a second center point between thecentroid and third geographic coordinates based on an average latitudevalue and longitude value of the centroid and third geographiccoordinates; and associating the second center point with the groupnumber.
 8. The method of claim 7, wherein the plurality of contactobjects comprises a fourth contact object having a fourth geographiccoordinate, and further comprising: comparing the fourth geographiccoordinate with the new center point to determine a third geographicaldistance; and associating the group number with the fourth contactobject if the third geographical distance is less than the thresholddistance.
 9. The method of claim 1, wherein the matching engine isfurther configured to generate a matching interface configured topresent contact objects associated with the group number.
 10. The methodof claim 9, wherein the matching interface is further configured toallow a user to modify the threshold distance.
 11. A method for matchingcontact objects in a database using geographical proximity comprising:providing access to a contact database configured to store a pluralityof contact records comprising first, second, and third contact records,wherein the first contact record comprises a first name and a firstgeographic coordinated, the second contact record comprises a secondname and a second geographic coordinate, and the third contact recordcomprises a third name and a third geographic coordinate; providingaccess to a matching engine that is communicatively coupled to thecontact database; comparing the first and second geographic coordinatesusing the matching engine to determine a geographical distance betweenthe first and second geographic coordinates; deleting the first orsecond contact record from the database and calculating a centroidgeographic coordinate if the geographical distance is less than athreshold distance; deleting the third contact record from the databaseif a second geographical distance is less than a threshold distance,wherein the second geographical distance is the distance between thethird geographic coordinate and the centroid geographic coordinate. 12.The method of claim 11, wherein the threshold distance is at least fivemiles.
 13. The method of claim 11, further comprising deleting the firstcontact record.
 14. The method of claim 11, further comprising mergingat least a portion of the first and second contact records.